54 research outputs found
Evaluating Classification Systems Against Soft Labels with Fuzzy Precision and Recall
Classification systems are normally trained by minimizing the cross-entropy
between system outputs and reference labels, which makes the Kullback-Leibler
divergence a natural choice for measuring how closely the system can follow the
data. Precision and recall provide another perspective for measuring the
performance of a classification system. Non-binary references can arise from
various sources, and it is often beneficial to use the soft labels for training
instead of the binarized data. However, the existing definitions for precision
and recall require binary reference labels, and binarizing the data can cause
erroneous interpretations. We present a novel method to calculate precision,
recall and F-score without quantizing the data. The proposed metrics extend the
well established metrics as the definitions coincide when used with binary
labels. To understand the behavior of the metrics we show simple example cases
and an evaluation of different sound event detection models trained on real
data with soft labels.Comment: published in DCASE 202
Incremental Learning of Acoustic Scenes and Sound Events
In this paper, we propose a method for incremental learning of two distinct
tasks over time: acoustic scene classification (ASC) and audio tagging (AT). We
use a simple convolutional neural network (CNN) model as an incremental learner
to solve the tasks. Generally, incremental learning methods catastrophically
forget the previous task when sequentially trained on a new task. To alleviate
this problem, we propose independent learning and knowledge distillation (KD)
between the timesteps in learning. Experiments are performed on TUT 2016/2017
dataset, containing 4 acoustic scene classes and 25 sound event classes. The
proposed incremental learner first solves the ASC task with an accuracy of
94.0%. Next, it learns to solve the AT task with an F1 score of 54.4%. At the
same time, its performance on the previous ASC task decreases only by 5.1
percentage points due to the additional learning of the AT task.Comment: Accepted to DCASE2023 Worksho
What is the ground truth? Reliability of multi-annotator data for audio tagging
Crowdsourcing has become a common approach for annotating large amounts of
data. It has the advantage of harnessing a large workforce to produce large
amounts of data in a short time, but comes with the disadvantage of employing
non-expert annotators with different backgrounds. This raises the problem of
data reliability, in addition to the general question of how to combine the
opinions of multiple annotators in order to estimate the ground truth. This
paper presents a study of the annotations and annotators' reliability for audio
tagging. We adapt the use of Krippendorf's alpha and multi-annotator competence
estimation (MACE) for a multi-labeled data scenario, and present how MACE can
be used to estimate a candidate ground truth based on annotations from
non-expert users with different levels of expertise and competence.Comment: submitted to EUSIPCO 202
Singing Voice Recognition for Music Information Retrieval
This thesis proposes signal processing methods for analysis of singing voice audio signals, with the objectives of obtaining information about the identity and lyrics content of the singing. Two main topics are presented, singer identification in monophonic and polyphonic music, and lyrics transcription and alignment. The information automatically extracted from the singing voice is meant to be used for applications such as music classification, sorting and organizing music databases, music information retrieval, etc.
For singer identification, the thesis introduces methods from general audio classification and specific methods for dealing with the presence of accompaniment. The emphasis is on singer identification in polyphonic audio, where the singing voice is present along with musical accompaniment. The presence of instruments is detrimental to voice identification performance, and eliminating the effect of instrumental accompaniment is an important aspect of the problem. The study of singer identification is centered around the degradation of classification performance in presence of instruments, and separation of the vocal line for improving performance. For the study, monophonic singing was mixed with instrumental accompaniment at different signal-to-noise (singing-to-accompaniment) ratios and the classification process was performed on the polyphonic mixture and on the vocal line separated from the polyphonic mixture. The method for classification including the step for separating the vocals is improving significantly the performance compared to classification of the polyphonic mixtures, but not close to the performance in classifying the monophonic singing itself. Nevertheless, the results show that classification of singing voices can be done robustly in polyphonic music when using source separation.
In the problem of lyrics transcription, the thesis introduces the general speech recognition framework and various adjustments that can be done before applying the methods on singing voice. The variability of phonation in singing poses a significant challenge to the speech recognition approach. The thesis proposes using phoneme models trained on speech data and adapted to singing voice characteristics for the recognition of phonemes and words from a singing voice signal. Language models and adaptation techniques are an important aspect of the recognition process. There are two different ways of recognizing the phonemes in the audio: one is alignment, when the true transcription is known and the phonemes have to be located, other one is recognition, when both transcription and location of phonemes have to be found. The alignment is, obviously, a simplified form of the recognition task.
Alignment of textual lyrics to music audio is performed by aligning the phonetic transcription of the lyrics with the vocal line separated from the polyphonic mixture, using a collection of commercial songs. The word recognition is tested for transcription of lyrics from monophonic singing. The performance of the proposed system for automatic alignment of lyrics and audio is sufficient for facilitating applications such as automatic karaoke annotation or song browsing. The word recognition accuracy of the lyrics transcription from singing is quite low, but it is shown to be useful in a query-by-singing application, for performing a textual search based on the words recognized from the query. When some key words in the query are recognized, the song can be reliably identified
Binaural Signal Representations for Joint Sound Event Detection and Acoustic Scene Classification
Sound event detection (SED) and Acoustic scene classification (ASC) are two widely researched audio tasks that constitute an important part of research on acoustic scene analysis. Considering shared information between sound events and acoustic scenes, performing both tasks jointly is a natural part of a complex machine listening system. In this paper, we investigate the usefulness of several spatial audio features in training a joint deep neural network (DNN) model performing SED and ASC. Experiments are performed for two different datasets containing binaural recordings and synchronous sound event and acoustic scene labels to analyse the differences between performing SED and ASC separately or jointly. The presented results show that the use of specific binaural features, mainly the Generalized Cross Correlation with Phase Transform (GCC-phat) and sines and cosines of phase differences, result in a better performing model in both separate and joint tasks as compared with baseline methods based on logmel energies only.acceptedVersionPeer reviewe
Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
We tackle the task of environmental event classification by drawing
inspiration from the transformer neural network architecture used in machine
translation. We modify this attention-based feedforward structure in such a way
that allows the resulting model to use audio as well as video to compute sound
event predictions. We perform extensive experiments with these adapted
transformers on an audiovisual data set, obtained by appending relevant visual
information to an existing large-scale weakly labeled audio collection. The
employed multi-label data contains clip-level annotation indicating the
presence or absence of 17 classes of environmental sounds, and does not include
temporal information. We show that the proposed modified transformers strongly
improve upon previously introduced models and in fact achieve state-of-the-art
results. We also make a compelling case for devoting more attention to research
in multimodal audiovisual classification by proving the usefulness of visual
information for the task at hand,namely audio event recognition. In addition,
we visualize internal attention patterns of the audiovisual transformers and in
doing so demonstrate their potential for performing multimodal synchronization
What is the ground truth? Reliability of multi-annotator data for audio tagging
Crowdsourcing has become a common approach for annotating large amounts of data. It has the advantage of harnessing a large workforce to produce large amounts of data in a short time, but comes with the disadvantage of employing non-expert annotators with different backgrounds. This raises the problem of data reliability, in addition to the general question of how to combine the opinions of multiple annotators in order to estimate the ground truth. This paper presents a study of the annotations and annotators' reliability for audio tagging. We adapt the use of Krippendorf's alpha and multi-annotator competence estimation (MACE) for a multi-labeled data scenario, and present how MACE can be used to estimate a candidate ground truth based on annotations from non-expert users with different levels of expertise and competence.acceptedVersionPeer reviewe
Diversity and bias in audio captioning datasets
Describing soundscapes in sentences allows better understanding of the acoustic scene than a single label indicating the acoustic scene class or a set of audio tags indicating the sound events active in the audio clip. In addition, the richness of natural language allows a range of possible descriptions for the same acoustic scene. In this work, we address the diversity obtained when collecting descriptions of soundscapes using crowdsourcing. We study how much the collection of audio captions can be guided by the instructions given in the annotation task, by analysing the possible bias introduced by auxiliary information provided in the annotation process. Our study shows that even when given hints on the audio content, different annotators describe the same soundscape using different vocabulary. In automatic captioning, hints provided as audio tags represent grounding textual information that facilitates guiding the captioning output towards specific concepts. We also release a new dataset of audio captions and audio tags produced by multiple annotators for a subset of the TAU Urban Acoustic Scenes 2018 dataset, suitable for studying guided captioning.publishedVersionPeer reviewe
- …